What are the Most Valued Data Science Skills?

By Meaghan, Albert, Hovig, Justin, Rose and Brian

March 25, 2018

Introduction

Kaggle ML and Data Science Survey, 2017

5 months ago Kaggle, a website that offers competitions to teams of data scientists for cash prizes, released their annual user survey. This comprehensive survey asked numerous questions to the Kaggle members in order to collect metrics on it’s user base. Our group selected this data to serve as our data set for determining the top hard and soft skills required for a data scientist.

Breakdown:

  • Nearly 3000 observations from a larger raw set of nearly 16000 observations subset to find working data scientists.
  • More than 200 questions on a variety of topics.

Credit to Amber Thomas for providing the following code used for extracting and summarizing answers to multiple-choice questions.

chooseOne = function(question){
    exp_df %>%
        filter(!UQ(sym(question)) == "") %>% 
        dplyr::group_by_(question) %>% 
        dplyr::summarise(count = n()) %>% 
        dplyr::mutate(percent = (count / sum(count)) * 100) %>% 
        dplyr::arrange(desc(count)) 
}

chooseMultiple = function(question,df){
  df %>% 
    dplyr::filter(!UQ(sym(question)) == "") %>%
    dplyr::select(question) %>% 
    dplyr::mutate(totalCount = n()) %>% 
    dplyr::mutate(selections = strsplit(as.character(UQ(sym(question))), 
                                 '\\([^)]+,(*SKIP)(*FAIL)|,\\s*', perl = TRUE)) %>%
    unnest(selections) %>% 
    dplyr::group_by(selections) %>% 
    dplyr::summarise(totalCount = max(totalCount),
              count = n()) %>% 
    dplyr::mutate(percent = (count / totalCount) * 100) %>% 
    dplyr::arrange(desc(count))
}        

Academic_exploration=function(question,df){
     df %>%
        filter(!UQ(sym(question)) == "") %>% 
        dplyr::group_by_(question) %>% 
        dplyr::summarise(count = n()) %>% 
        dplyr::mutate(percent = (count / sum(count)) * 100) %>% 
        dplyr::arrange(desc(count)) 
  }

proportion_function <- function(vec){
    vec/sum(vec)*100
}

create_breaks <- function(dfcolumn,breaks,labels){
    dfcolumn <- as.numeric(dfcolumn)
    dfcolumn <- cut(dfcolumn,breaks=breaks,labels=labels,right=FALSE)
}

Profile of a Data Scientist

Data Scientist Demographics

INTRODUCTION HERE

CONCLUSION

Learning Platform Usefulness

Usefulness of Various Learning Platforms

INTRODUCTION

CONCLUSION

Learning Categories

How Data Scientists Learned Their Core Skills

In this section, we examine how data scientists gained their skill set. We believe there may be valuable insight in what makes a strong data scientist by examing how successful data scientists gained their skill set.

Our data shows a great diversity in learning styles. This indicates that not only do data scientists learn from a variety of sources, but every data scientist’s sources vary in importance. This highlights the idea that there is no right or wrong way to learn to become a data scientist. At the same time, as the four major categories amount for nearly 100% of education, this means that there are no “secret” learning sources.

It is interested to note that nearly 75% of data scientists indicate they learned while on the job.

Common Job Algorithms

Common Alogrithms and Methods Used by Data Scientists

INTRODUCTION

In this section, we explore commonly used algorithms and methods that are presumably required as basic skills in data science field.

CONCLUSION

It appears that on average, data scientists use at least 3 algorithms and 7 methods in their work. As the bar graph shows above, the most commonly used algorithms and methods as follows:

  • Algorithm
    • Regression/Logistic Regression (15.65%)
    • Decision Trees (12.96%)
    • Random Forests (11.7%)
  • Methods
    • Data Visualization (8%)
    • Logistic Regression (6.83%)
    • Cross-validation (6.74%)
    • Decison Tress (5.93%)
    • Random Forests (5.63%)
    • Neural Networks (5.28%)
    • Time Series Analysis (5.03%)

An average data scientist is able to the above listed algorithms and methods as basic hard skills to meet the standard industry expectation. An exceptional data scientist may be capable of handling 7 to 30 methods and 4 to 15 algorithms.

Furthermore, the most commonly used size of dataset appears to fall in the 1GB ~ 10GB range ( > 50%). For reference, the last graph displays the most used methods by size of dataset.

Work Tools Freqeuncy

Frequency of Use for Various Tools by Data Scientists

CONCLUSION

Work Challenges

Challenges Faced by Data Scientists

In this section, we address the challenges faced by Data Scientists, and how their time is typically spent at work.

CONCLUSION

Conclusion

Skill List

Data Scientists…

HARD SKILLS

  1. List

SOFT SKILLS

  1. learn from diverse sources.
  2. continue to learn even after they have secured a job.

In conclusion…

SQL

The original kaggle data was in an untidy form. As part of data preparation we each created tidy data sub sets and saved them to a series of csv files stored on our github. The following SQL script will import them into a series of tables. We hope that this will aid future research and help to find connections that we may have missed.

holding